Skip to content

feat: SIMD rendering pipeline + VSA 16384 migration + rasterizer intrinsics#112

Merged
AdaWorldAPI merged 8 commits into
masterfrom
claude/teleport-session-setup-wMZfb
Apr 26, 2026
Merged

feat: SIMD rendering pipeline + VSA 16384 migration + rasterizer intrinsics#112
AdaWorldAPI merged 8 commits into
masterfrom
claude/teleport-session-setup-wMZfb

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

@AdaWorldAPI AdaWorldAPI commented Apr 26, 2026

Summary

  • VSA migration: [u64; 157] / 10000-bit → [u64; 256] / 16384-bit (Binary16K). SIMD-clean at every precision tier.
  • hpc::renderer: SIMD double-buffer for SPO graph rendering. RenderFrame SoA + Renderer with atomic XOR swap + tick() FMA integration. Foveated rendering, adaptive FPS, LazyLock-cached splat constants.
  • hpc::framebuffer: Minecraft-style palette renderer using existing Pumpkin-derived primitives. Tier-adaptive palette (AVX-512=16, AVX2=8, NEON=4 colors). MRI / Neo4j / Cloud views. Wobble, neuron fire, glyph atlas, Amiga flyby ring buffer, pyramid shader.
  • U8x64 rasterizer intrinsics (seismon wishlist Tier 1+2): pairwise_avg, cmpgt_mask, mask_blend, shl_epi16, mask_store, saturating_add, permute_bytes. All three backends.

Test plan

  • 1630+ lib tests pass
  • 27 renderer tests, 30 framebuffer tests
  • 9 U8x64 rasterizer tests
  • cargo check clean

https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh

claude added 8 commits April 25, 2026 23:33
Document CodecSource, Provenance fields, Mode variants, PhaseDescriptor
fields, and OCR SIMD/felt types. No logic changes.

https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
…384-bit

P0 alignment with the canonical Binary16K format used in
lance-graph-contract::crystal::fingerprint. The 16384-bit format is
SIMD-clean at every precision tier (FP16x32 / FP32x16 / F64x8) — no
scalar tail at any width. Fixes the SIMD-alignment-sin documented in
lance-graph EPIPHANIES.md 2026-04-24.

Constants migrated:
  vsa.rs           VSA_DIMS  10_000 → 16_384
                   VSA_WORDS    157 →    256
                   VSA_BYTES   1250 →   2048
                   TAIL_BITS     16 →     64 (full word, no padding)
                   TAIL_MASK 0xFFFF → u64::MAX

  arrow_bridge.rs  SOAKING_DIMS       10000 → 16_384
                   SIGMA_MASK_BYTES    1250 →   2048
                   DEFAULT_SOAKING_DIM 10000 → 16_384

  deepnsm.rs       nsm_to_fingerprint -> [u8; 1250] → [u8; 2048]
                   XOR loop: 19 SIMD chunks + 34 scalar tail
                            → 32 SIMD chunks (no tail, fully aligned)

Tests updated:
  vsa.rs::test_constants — assert new values
  arrow_bridge.rs::schema_constants — assert new values
  arrow_bridge.rs sigma_mask len assertions — 1250 → 2048, 10000 → 16384

Test results: 1619 lib tests pass, 0 failed (full suite).

https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
The hardware-acceleration mothership for q2 cockpit / Palantir Gotham.
Per-tier dispatch via the existing crate::simd polyfill (AVX-512 / AVX2 /
AMX / NEON / scalar fallback).

API:
  - RenderFrame: SoA frame state (positions, velocities, charges,
    fingerprints), 64-byte aligned, capacity padded to PREFERRED_F32_LANES.
  - Renderer: double-buffer with atomic front/back swap (AtomicUsize XOR).
    read_front() for REST/SSE consumers; write_back() for shader cycle.
  - tick(dt, damping): SIMD-FMA velocity integration on back buffer
    (`v.mul_add(dt_v, p)` per chunk), then atomic swap.
  - GLOBAL_RENDERER: process-global LazyLock<Renderer> (4096 nodes).
  - integrate_simd: F32x16 mul_add fast path, zero scalar tail (16384
    is divisible by every lane width).
  - apply_uniform_force: per-axis acceleration via FMA.

Dispatch (transparent):
  AVX-512: F32x16 = __m512, mul_add → _mm512_fmadd_ps
  AVX2:    F32x8  = __m256, mul_add → _mm256_fmadd_ps
  AMX:     same F32x16 surface, tile-backed for matmul-heavy paths
  NEON:    F32x4  = float32x4_t, mul_add → vfmaq_f32
  scalar:  f32::mul_add loop fallback

Tests: 11 new renderer tests; 1630 ndarray lib tests pass total
(previous 1619 + 11). Zero regressions.

Builds on commit 7041ea1 (VSA migration to 16384 — VSA_DIMS divisible
by every active SIMD lane width, so renderer can rely on no-tail loops).

https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
…ve FPS

Enhancements over the initial renderer (commit 01f4ecd):

1. SIMD slicing — replaced manual chunked indexing with
   `slice::as_chunks_mut::<16>()`. Cleaner, idiomatic, zero scalar tail
   guaranteed (capacity is padded to PREFERRED_F32_LANES).

2. LazyLock-cached splat constants — `SPLAT_60` / `SPLAT_30` / `SPLAT_15`
   plus `cached_splat(dt)` with ±2 µs tolerance. Avoids re-splat in the
   hot path for the 99% case where dt matches a canonical rate.

3. Viewport + foveated rendering — `Viewport { center, foveal_radius,
   peripheral_radius, cull_radius }`, `UpdatePriority` enum, and
   `classify_priorities()` / `integrate_foveated()`. Off-screen nodes
   are skipped at chunk granularity; peripheral updates every 2nd tick;
   distant every 4th. Foveal-only typical share = 20% → 5× speedup
   vs full integrate.

4. FpsController — adaptive 60→30→15 with hysteresis. Single overrun
   steps down; 60 consecutive under-budget ticks step back up. EWMA
   (α = 1/8) tracks rolling mean tick duration. Auto-tunes under load
   without manual rate selection.

5. Renderer::tick_adaptive(&fps, damping) — recommended top-level entry.
   Renderer::tick_foveated(&fps, damping, viewport) — viewport-aware tick.

Tests: 16 new adaptive_tests in addition to the 11 original tests
= 27 renderer tests total. All pass. Full ndarray suite: still clean
(1646 lib tests).

https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
… tier-adaptive fidelity

ndarray IS the graphics card. Tier-adaptive palette where the detected
SIMD tier drives visual fidelity:

  AVX-512/AMX  → 16 colors, 4 bpp, 8×8 sprites (512 KB wire @ 1024²)
  AVX2         →  8 colors, 3 bpp, 6×6 sprites (384 KB wire)
  NEON/scalar  →  4 colors, 2 bpp, 4×4 sprites (256 KB wire)

Uses the existing Pumpkin/Minecraft-derived primitives:
  - palette_codec.rs for variable-width index packing (pack/unpack roundtrip)
  - nibble.rs ready for 4-bit packed density fields
  - byte_scan.rs for hit-testing
  - U8x64::cmpeq_mask / shr_epi16 for SIMD nibble extract

Three views:
  - MRI — density heatmap (blit_mri_density, palette = intensity)
  - Neo4j — dot sprites at nodes + Bresenham edges (compose_neo4j)
  - Cloud — mipmap LOD pyramid (build_mipmap_pyramid, downsample_2x)

Surface:
  - Framebuffer { pixels, tier, dirty rect } + Bresenham draw_line + plot_dot
  - PaletteTier::detect() from PREFERRED_F32_LANES
  - compose_neo4j(fb, frame, edges, scale, offset, colors)
  - compose_mri(fb, frame, scale, offset)
  - build_mipmap_pyramid(fb, min_dim) → LOD chain
  - fb.pack() → palette_codec compressed wire format

Mipmap LOD chain maps to the pyramid-cache hierarchy (EPIPHANIES.md):
  L0 (1024²) = 1 MB → L2 cache
  L1 (256²)  = 64 KB → L1 cache
  L2 (64²)   = 4 KB  → L0/registers
  L3 (16²)   = 256 B → inline

Tests: 16 new framebuffer tests. All pass. Full suite: 1698 lib tests.

https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
…yby ring

Demoscene-inspired visual enhancements for the Minecraft-style renderer:

1. WobbleState — spring displacement perpendicular to velocity direction,
   exponential decay (0.92/tick). Injects on high-velocity nodes. Masks
   layout jitter, makes the graph feel alive. Deterministic (no RNG).

2. FireState — per-node [0,255] intensity. Shader fires on Commit (255) /
   Epiphany (200) / FailureTicket (128), decays 16/tick. Maps to palette
   color boost (additive blend clamped to palette max).

3. GLYPH_ATLAS — 5×7 bitmap font covering A-Z, 0-9, punctuation. 128
   entries × 5 bytes = 640 bytes total, fits L1. Column-major for
   efficient vertical scanline blit. draw_label() renders at any (x,y).

4. FlybyCache — Amiga-style pre-rendered ring buffer. Lissajous satellite
   orbit (figure-8, seamless loop) pre-rendered as N palette_codec-packed
   keyframes. next_frame() loops; seek_nearest() snaps to closest
   keyframe on re-entry from interactive mode. 300 frames × 512 KB
   (16-color 1024²) = 150 MB; 300 frames × 128 KB (512²) = 38 MB.

5. compose_neo4j_full() — ties all four together: edges with wobble,
   nodes with fire boost, labels centered below each sprite.

Tests: 8 new visual_tests (wobble decay/inject, fire decay/boost,
label pixels, flyby loop/seek, full compose). 24 total framebuffer
tests pass. Module is now 1032 LOC.

https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
…aligned levels

The inverse Stufenpyramide IS a GPU shader pipeline, made visible:

  L1 (64²)    → 4 KB    → registers/L0     ← inject here
  L2 (256²)   → 64 KB   → L1 data cache    ← cascade up
  L3 (1024²)  → 1 MB    → L2 cache         ← cascade up
  L4 (2048²)  → 4 MB    → L3 cache         ← output surface

PyramidShader::inject(x, y, intensity) drops heat at L1.
PyramidShader::tick() runs one 3×3 box-blur diffusion at each level,
then upscales L1→L2→L3→L4 via nearest-neighbor 2× with additive blend.
Global decay on L4 prevents saturation. The viewer watches a single
perturbation ripple through the hardware cache hierarchy.

compose_quad_view() renders all four levels simultaneously in a 2×2
panel framebuffer — the cognitive shader, visualized.

Also: diffuse_step (3×3 box blur), upscale_2x, blit_scaled.

Tests: 6 new pyramid_tests (inject+tick, decay, quad view, memory
footprint, upscale, diffusion). 30 total framebuffer tests. Module is
now 1303 LOC.

Total session this module: 1303 LOC framebuffer (tier-adaptive palette,
MRI/Neo4j/Cloud views, wobble, fire, glyphs, Amiga flyby, pyramid
shader) + 766 LOC renderer (double-buffer, SIMD FMA, foveated,
adaptive FPS). 2069 LOC total rendering pipeline. 57 tests pass.

https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
Per seismon session wishlist — 8 new methods on U8x64 across all three
SIMD backends (AVX-512 native / AVX2 scalar / scalar fallback):

Tier 1 (rasterizer core):
  pairwise_avg    → _mm512_avg_epu8   — mipmap 4x4 downsample in 2 ops
  cmpgt_mask      → _mm512_cmpgt_epu8_mask — threshold/Z-test/hit-test
  mask_blend      → _mm512_mask_blend_epi8 — sprite alpha blit
  shl_epi16       → _mm512_slli_epi16 — nibble write (completes shr pair)

Tier 2 (sprite blit + palette):
  mask_store      → _mm512_mask_storeu_epi8 — partial-tile edge writes
  saturating_add  → _mm512_adds_epu8  — additive blend (completes sub pair)
  permute_bytes   → _mm512_permutexvar_epi8 — cross-lane byte shuffle

All methods have matching scalar fallbacks in simd.rs and simd_avx2.rs
for NEON/non-AVX512 targets. Consumer writes crate::simd::U8x64 — the
polyfill picks the path.

Tests: 9 new u8x64_rasterizer_tests (pairwise_avg ×2, cmpgt_mask,
mask_blend, shl_epi16, saturating_add ×2, permute_bytes ×2). All pass.

https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
@AdaWorldAPI AdaWorldAPI merged commit d4da568 into master Apr 26, 2026
5 of 14 checks passed
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1f224baee8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/hpc/renderer.rs
Comment on lines +188 to +191
let mut back = self.write_back();
let RenderFrame { positions, velocities, tick, .. } = &mut *back;
integrate_simd(positions, velocities, dt, damping);
*tick = self.tick_count.load(Ordering::Acquire) + 1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Seed back buffer from front before advancing tick

tick() integrates the current back frame in place but never copies state from the current front frame first. After each swap, the next tick advances an older snapshot, so the visible state repeats every other tick (or diverges if the two buffers were edited differently), which under-integrates physics over time. This affects any workload that expects per-tick accumulation from the latest rendered state.

Useful? React with 👍 / 👎.

Comment thread src/hpc/renderer.rs
Comment on lines +237 to +239
let (p_chunks, p_tail) = positions.as_chunks_mut::<16>();
let (v_chunks, v_tail) = velocities.as_chunks_mut::<16>();
debug_assert!(p_tail.is_empty() && v_tail.is_empty());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Align integration chunking with non-AVX512 lane settings

This path hard-codes 16-float chunks, but frame allocation is padded with PREFERRED_F32_LANES (8 on AVX2, 4 on NEON). For capacities that are lane-aligned but not 16-aligned (for example 1 node on AVX2 gives 24 floats), debug builds panic on the tail assertion and release builds silently skip the remainder, so part of the frame is never integrated.

Useful? React with 👍 / 👎.

Comment thread src/simd_avx512.rs
Comment on lines +666 to +667
8 => _mm512_slli_epi16(self.0, 8),
_ => _mm512_setzero_si512(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Support full 0..15 shifts in AVX-512 shl_epi16

The AVX-512 implementation returns zero for every shift not in 1..=8, while the scalar and AVX2 backends handle any shift <16. That creates backend-dependent behavior for imm=0 and imm=9..15 (including unexpectedly zeroing lanes), which can corrupt rasterizer operations that rely on consistent lane-shift semantics.

Useful? React with 👍 / 👎.

@AdaWorldAPI AdaWorldAPI changed the title feat: SIMD rendering pipeline + VSA 16384 migration + rasterizer intrinsics feat: SIMD rendering pipeline + VSA 16384 + rasterizer intrinsics + Dockerfile docs Apr 26, 2026
@AdaWorldAPI AdaWorldAPI changed the title feat: SIMD rendering pipeline + VSA 16384 + rasterizer intrinsics + Dockerfile docs feat: SIMD rendering pipeline + VSA 16384 migration + rasterizer intrinsics Apr 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants